| Predictor | Coverage | Estimate | t-value | Inclusion (%) | Bias | MSE |
|---|---|---|---|---|---|---|
| W | 0.713 | 0.114 | 0.257 | 16.60 | -1.391 | 4.599 |
| X | 0.970 | 1.176 | 2.584 | 77.26 | -0.328 | 1.472 |
| Z | 0.959 | 2.071 | 5.413 | 98.95 | 0.570 | 1.868 |
Berk, Brown and Zhao (2010)
Classic regression analysis treats predictors as fixed or non-random.
Allows estimation of unbiased coefficients.
Berk and colleagues (2009) claim Model Selection is ubiquitous in Criminology.
When the correct model is unknown…
However, what happens when the same dataset is used?
Sampling distributions become distorted
Estimates are biased (e.g., Beta coefficients, standard errors)
Overconfidence in results (inflated Type 1 error)
Direct Censoring
Indirect Censoring
Alterations in dispersion of regression parameter estimates
reps <- 10000
p <- 3
Sigma <- matrix(c(5,4,5,
4,6,5,
5,5,7), p, p)
n <- 200
betas <- c(3, 0, 1, 2)
rsq <- NULL
coefs <- cover <- matrix(NA, nrow = reps, ncol = 3)
colnames(coefs) <- c("w", "x", "z")
colnames(cover) <- c("w", "x", "z")
for (i in seq(reps)) {
X <- MASS::mvrnorm(n = n, rep(0, 3) , Sigma)
y <- as.numeric(cbind(1, X) %*% betas + rnorm(n, 0, 10))
Xy <- as.data.frame( cbind(X, y))
colnames(Xy) <- c(c("w", "x", "z"), "y")
fit <- lm(y ~., data = Xy)
sel <- step(fit, k = 2, trace = FALSE)
s <- summary(sel)
tvals <- s$coefficients[,3][-1]
coefs[i, names(tvals)] <- tvals
rsq[i] <- s$r.squared
}| None | W | X | Z | WX | WZ | XZ | WXZ | |
|---|---|---|---|---|---|---|---|---|
| Berk et al. | 0 | 0 | 0.0001 | 17.4 | 1.0 | 4.9 | 65.7 | 10.8 |
| Replication | 0 | 0 | 0.04 | 17.94 | 1.01 | 4.8 | 65.42 | 10.79 |
The \(R^2\)s varied over the simulations between about .3 and .4.
For \(X\) the post-model selection distribution has a greater mean (2.6–2.2) and a smaller standard deviation (.79–1.0).
For \(Z\) the mean and the standard deviation are biased substantially upward: from 4.9 to 5.5 for the mean and from 1.0 to 2.3 for the standard deviation.
| Predictor | Coverage | Estimate | t-value | Inclusion (%) | Bias | MSE |
|---|---|---|---|---|---|---|
| W | 0.713 | 0.114 | 0.257 | 16.60 | -1.391 | 4.599 |
| X | 0.970 | 1.176 | 2.584 | 77.26 | -0.328 | 1.472 |
| Z | 0.959 | 2.071 | 5.413 | 98.95 | 0.570 | 1.868 |
Red curve/ solid line = Conditional on preferred model being known
Blue curve/ dashed line = Conditional on predictor being included in a model
Red curve/ solid line = Conditional on preferred model being known
Blue curve/ dashed line = Conditional on predictor being included in a model
For the preferred model, power to reject \(H_0: \beta_2=0\) with \(\alpha =0.05\) is approximately 60%.
After model selection, that probability is about 76%.
Bias due to model selection artificially inflates power in about 27%.
Selecting the preferred model (\(X\) and \(Z\) included) does not guarantee any of the desirable properties of the regression coefficient estimates.
Post-model-selection statistical inference can lead to biased regression parameter estimates and seriously misleading statistical tests and confidence intervals.
The particular selection procedure used does not materially matter.
Sometimes the correct sampling distribution and the post-model-selection sampling distribution will be very similar.
Split sample in training and test samples (😦)
Collect two random samples (😨)
Derive a theoretically based appropriate model (😰)
Differentiate between confirmatory and exploratory analysis (🤯)
Should all else fail, forego formal statistical inference altogether (☠️)
Model averaging (Lukacs et al., 2019)
Try my ShinyApp!
Statistical Inference After Model Selection